Diabetes mellitus (DM) represents a major public health concern and is closely associated with factors such as obesity, age, race, and gender. Identifying these associated risk factors is essential for developing targeted and effective intervention strategies D’Angelo and Ran (2025). Logistic Regression that estimates the association between risk factors and binary outcomes (presence or absence of diabetes) is a standard analytical approach but, is insufficient in analyzing the complexity of healthcare data (DNA sequences, imaging, patient-reported outcomes, electronic health records (EHRs), longitudinal health measurements, diagnoses, and treatments. Zeger et al. (2020). Classical maximum likelihood estimation (MLE) yields unstable results in small samples with missing data, or quasi- and complete separation.
The Bayesian hierarchical model with Markov Chain Monte Carlo (MCMC) on multivariate longitudinal healthcare data with two levels: (1) repeated measures over time within individuals and (2) individuals nested within a population, by integrating prior knowledge and by added exogenous covariates (e.g., age, clinical history), and endogenous covariates (e.g., current treatment), yield posterior distributions and marginal distributions from MCMC estimation of parameters provide risk prediction (pneumonia, prostate cancer, and mental disorders). The model’s limitation is its parametric nature.
In Bayesian inference Chatzimichail and Hatjimihail (2023), comparisons between parametric models (with fixed sets of parameters) and non-parametric models (which make no prior assumptions about the distribution) applied to the National Health and Nutrition Examination Survey demonstrated the use of posterior probabilities to classify diseases. In contrast, conventional approaches that rely on fixed clinical thresholds and dichotomous classifications often fail to capture the complex relationships between diagnostic tests and disease prevalence. These traditional methods tend to oversimplify the underlying heterogeneity across populations and perform poorly when dealing with skewed, bimodal, or multimodal data distributions.
Bayesian nonparametric modeling is flexible, adaptable, and robust alternative to traditional parametric approaches capturing complex and multimodal data patterns by integration of priors with multiple diagnostic tests enhances diagnostic accuracy and precision. However, the approach can be limited by restricted access to scholarly resources and excessive reliance on prior information. Combined with other statistical and computational techniques, Bayesian nonparametric models can further strengthen diagnostic performance addressing systemic bias, incomplete data, and non-representative or non-normal datasets Chatzimichail and Hatjimihail (2023)
The Bayesian framework described by Schoot et al. (2021) highlights the critical role of priors, data modeling, inference, model checking, posterior sampling, variational inference, and variable selection in applications across social sciences, ecology, genetics, and medicine. Variable selection is important because multicollinearity, limited sampling, and overfitting can lead to weak predictive performance and reduced interpretability. Based on the degree of uncertainty in hyperparameters, priors are classified as informative, weakly informative, or diffuse (larger variances indicating greater uncertainty).
Prior elicitation may be based on expert opinion, generic knowledge, or data-driven using maximum likelihood estimates or sample statistics. Prior sensitivity analysis assesses the alignment between priors and the likelihood. Prior improve estimation via data-informed shrinkage, regularization, and influence on algorithms by focusing on high-density regions. Specifying prior information strengthens inference by integrating observed data with uncertain parameters in small or less informative samples, generating a range of possible values through the likelihood function. The Markov Chain Monte Carlo (MCMC) simulation algorithm, ( brms and blavaan in R), allows empirical estimation of posterior distributions. Spatial and temporal Bayesian models analyze large-scale cancer genomics, identify novel molecular interactions, detection of mutational signatures, genomic-based patient stratification for clinical trials, and insights into cancer evolution, but ahve limited application because of temporal autocorrelation and the inherent subjectivity in prior elicitation.
Bayesian normal linear regression, has been applied in metrology for instrument calibration using parametric (conjugate) models with a Normal–Inverse-Gamma (NIG) prior Klauenberg et al. (2015) as in Gaussian regression, errors are independent and identically distributed with unknown variance, the relationship between X and Y is statistical and due to the associated noise and model uncertainty, can not be treated as a direct measurement function. The Bayesian method accounts for uncertainty, treats both observables (data) and unobservables (parameters and auxiliary variables) as random and assigns probability distributions to unobservables. Incorporation of prior, enhances interpretability and robustness. The NIG distribution (conjugate prior) enables the use of vague or non-informative priors, and hierarchical priors further add flexibility by modeling uncertainty across multiple levels, and capture complex data nuances.
Bayesian Hierarchical / meta-analytic linear regression model augments data by incorporating both exchangeable and unexchangeable information on parameters, addressing issues associated with multiple testing providing low statistical power when separate significance tests are conducted across studies with different predictors. Adding priors from meta-analysis in Bayesian Hierarchical regression address the challenge of small sample size with limited availability of previous articles, and resolve limitation of univariate analyses, and the relationship issues among multiple regression parameters within a study. (1) Exchangeable Priors are when the current data and previous studies share the same set of predictors, and (2) Unexchangeable when the predictors differ. The hierarchical unexchangeable model is applicable in studying differences in studies, enabling explicit testing of the exchangeability assumption. Approach is limited due to the correlation between identical set of predictors. Leeuw and Klugkist (2012)
A sequential clinical reasoning model** Liu et al. (2013) demonstrated screening adults (20–79 years, Taiwan) while addressing the limited availability of molecular information and provided an alternative method leveraging routinely collected biological markers classifying diseases. Sequential adding of predictors in three models: (1) demographic features (basic model), (2) six metabolic components (metabolic score model), and (3) conventional risk factors (enhanced model), incorporating priors, and emulating a clinician’s evaluation process, the model assumes normally distributed regression coefficients, accounts for uncertainty in clinical weights, and averages credible intervals for predicted risk estimates. by incorporating patient background, individual characteristics, in the enhanced model, significantly contributed to the baseline risk estimation by capturing ecological heterogeneity. The model limitation is interactions between predictors and the external cross-validation issue.
Bayesian multiple imputation with logistic regression addresses missing data in clinical research Austin et al. (2021) by classifying and analyzing missing values based on (i) patients’ refusal (ii) loss to follow-up, (iii) investigator or mechanical errors, or (iv) physicians’ choice on ordering investigations. Missing at random (MAR), missing not at random (MNAR), or missing completely at random (MCAR) are crucial in data analysis. Multiple imputation (MI) (R, SAS, or Stata) provides plausible values while simultaneously conducting identical statistical analyses across them, creating multiple completed datasets, and the pooled results classify patients’ health status and mortality rates.
Aims
The study aims to apply Bayesian logistic regression to predict diabetes status and to evaluate the associations between body mass index (BMI), age (≥20 years), gender, and race as predictors, using a retrospective dataset from the 2013–2014 NHANES survey. NHANES employs a complex sampling design, including stratification, clustering, and oversampling of specific population subgroups, rather than uniform random sampling. A Bayesian analytical approach is used to address challenges posed by dataset anomalies such as missing data, complete case analysis, and separation that limit the efficiency and reliability of traditional logistic regression in predicting health outcomes.
Method and Data Preparation
Statistical Tool: we use R, R packages and libraries to import data, perform data wrangling and analysis.
Data source: NHANES 2-year data is a cross-sectional weighted data (2013-2014 year) Center for Health Statistics (1999). Three datasets (demographics, exam, questionnaire) imported (Haven package to coverted .XPT files in R to dataframe (df)) and after selecting variables of interest a merged dataset was created.
Modeling - We conduct bayesian logisitc regression to estimate the association between BMI, age, sex, and race/ethnicity and predict doctor-diagnosed diabetes (DIQ010).
Data pre-processing and cleaning Subsets are created from the original weighted datasets (demographics, exam, questionnaire) with selected variables are merged using ID to create a single dataframe.
Response Variable: Binary, type 2 / diagnosed diabetes (excluding gestational diabetes) -“Doctor told you have diabetes?” DIQ010 combined with DIQ050 a secondary variable describing treatment status (insulin use) to exclude those cases
Predictor Variables (Body Mass Index, factor, 4 levels are analyzed after standardization).
o Underweight (<5th percentile)
o Normal (5th–<85th)
o Overweight (85th–<95th) o Obese (≥95th percentile)
o Missing We kept it as it is as categorization provides clinically interpretable groups
Covariates:
Gender (factor, 2 levels): Male: Female
Ethnicity (factor, 5 levels): Mexican American, Non-Hispanic, White Non-Hispanic, Black Other Hispanic, Other Race - Including multi-racial
Age (number, continuous)
Merged dataset created, cleaned and visualized for any anomalies and ata Exploration.
Code
# weighted means of each variable str(merged_data)
cat("Count with DIQ010 in {1,2}:", sum(merged_data$DIQ010 %in%c(1,2), na.rm =TRUE), "\n")
Count with DIQ010 in {1,2}: 9578
Code
# ---- Save to file for reuse ----dir.create("data", showWarnings =FALSE)# ---- Save ----dir.create("data", showWarnings =FALSE, recursive =TRUE)saveRDS(merged_data, "data/merged_2013_2014.rds")message("Saved: data/merged_2013_2014.rds")
EDA
Used library(survey) to get weighted means and sd of the variables. The BMI and age were standardized.
Age was recoded into different variabless, including only >20 years in the analysis.
BMI is recoded and categorized as-“18.5,18.5–<25,25–<30,30–<35,35–<40,≥40 years).
Since special codes are not random, cannot be dropped; the informative missingness if ignored (MAR or MNAR) could introduce bias. We transformed special codes (3,7,) to NA and included all NAs in the analysis. Visulaization of missing data presented below.
A final analytic dataset was created (‘adult’) with “NH White” and “Male” as the reference group
Code
## # ---------------- Basic Exploration (adults) ----------------# Keep adults only and build analysis variablesadult <- merged_data %>% dplyr::filter(RIDAGEYR >=20) %>% dplyr::transmute(# --- keep survey design variables so svydesign() can see them --- SDMVPSU, SDMVSTRA, WTMEC2YR,# --- outcome: DIQ010 (1 yes, 2 no; 3/7/9 -> NA) ---diabetes_dx = dplyr::case_when( DIQ010 ==1~1, DIQ010 ==2~0, DIQ010 %in%c(3, 7, 9) ~NA_real_,TRUE~NA_real_ ),# --- predictors (raw) ---bmi = BMXBMI,age = RIDAGEYR,# sex (1=Male, 2=Female)sex = forcats::fct_recode(factor(RIAGENDR), Male ="1", Female ="2"),# race (5-level)race = forcats::fct_recode(factor(RIDRETH1),"Mexican American"="1","Other Hispanic"="2","NH White"="3","NH Black"="4","Other/Multi"="5" ),# keep DIQ050 so we can safely reference it (may be absent/NA in some rows)DIQ050 = DIQ050 ) %>%# standardize continuous predictors dplyr::mutate(age_c =as.numeric(scale(age)),bmi_c =as.numeric(scale(bmi)),bmi_cat =cut( bmi,breaks =c(-Inf, 18.5, 25, 30, 35, 40, Inf),labels =c("<18.5","18.5–<25","25–<30","30–<35","35–<40","≥40"),right =FALSE ) ) %>%# adjust outcome: if female & DIQ050==1 ("only when pregnant"), set to 0 (not diabetes) dplyr::mutate(diabetes_dx =ifelse(sex =="Female"&!is.na(DIQ050) & DIQ050 ==1, 0, diabetes_dx) )# Make NH White the reference level for race (clearer interpretation)adult <- adult %>% dplyr::mutate(race = forcats::fct_relevel(race, "NH White") )# --- sanity checks ---cat("Adults n =", nrow(adult), "\n")
Adults n = 5769
Code
# data explorationprint(table(adult$diabetes_dx, useNA ="ifany"))
0 1 <NA>
4974 618 177
Code
print(table(adult$sex, useNA ="ifany"))
Male Female
2758 3011
Code
print(table(adult$race, useNA ="ifany"))
NH White Mexican American Other Hispanic NH Black
2472 767 508 1177
Other/Multi
845
Code
if (sum(!is.na(adult$diabetes_dx)) ==0) {stop("Too few non-missing outcomes for modeling (n = 0). Check DIQ010 upstream.")}# (optional plots omitted for brevity)# save for downstreamif (!dir.exists("data")) dir.create("data", recursive =TRUE)saveRDS(adult, "data/adult_cleaned_2013_2014.rds")
Code
# survey design# ---------------- Survey Design ----------------# Use exam weights because BMI (BMXBMI) is an MEC variablenhanes_design_adult <- survey::svydesign(id =~SDMVPSU,strata =~SDMVSTRA,weights =~WTMEC2YR,nest =TRUE,data = adult)# quick weighted checkssurvey::svymean(~age, nhanes_design_adult, na.rm =TRUE)
Multiple Logistic regression on survey weighted dataset
-We conducted frequentist method Multiple Logistic regression on a survey-weighted dataset, for complete case analysis and data exploration
Multivariate Imputation by Chained Equations (MICE) - We then conducted MICE to manage missiging data - Considering the small sample size, Multivariate Imputation by Chained Equations (MICE) was conducted as an alternative to the Bayesian Approach Buuren and Groothuis-Oudshoorn (2011)
Multiple imputation (MI) is an alternative analytic approach for small dataset with missingness.
Flatness of the density, heavy tails, non-zero peakedness, skewness and multimodality do not hamper the good performance of multiple imputation for the mean structure in samples n > 400 even for high percentages (75%) of missing data in one variable @Van Buuren and Van Buuren (2012).
Multiple Imputation (MI) can be performed using mice package in R and adds sampling variability to the imputations.
Iterative MICE imputes missing values of one variable at a time, using regression models based on the other variables in the dataset.
In the chain process, each imputed variable become a predictor for the subsequent imputation, and the entire process is repeated multiple times to create several complete datasets, each reflecting different possibilities for the missing data.
Bayesian Logistic Regression
Bayesian statistics is about updating beliefs with evidence:
Posterior ∝ Likelihood × Prior
Prior (p(θ)): Your initial belief about a parameter before seeing the data.
Likelihood (p(y|θ)): How probable the observed data are given the parameters. This is derived from the model (e.g., logistic regression likelihood).
Posterior (p(θ|y)): Your updated belief about the parameter after seeing the data.
We selected Bayesian Logistic Regression since our study response variable is a binary outcome (Diabetes:yes/no)
Bayesian Logistic Regression is based on binomial probability Bayes’ rules, and predicts probability of disease outcome
Bayes analyzes linear relation between the predictor (Age, Race, BMI, Gender) and outcome response variable (Diabetes).
it considers that predictors and response variables are independent.
Regression a of a discrete variable (0 or 1) is a Bernoulli probability model that classifies categorical response variables - predicting Diabetes.
Logit link provides probabilities for the response variable.
We use Weakly informative priors Normal (0, 2.5) for logistic regression coefficients and intercept, Normal(0, 10), allows a wide range of baseline log-odds and helps with convergence and avoids extreme estimates. Good default for most applications in social, health, or epidemiological studies.
In Bayesian statistics, every unknown parameter (like a regression coefficient, mean, or variance) is treated as a random variable with a probability distribution that reflects uncertainty.
we then summarize in Bayesian results (posterior mean, credible intervals, etc.).
# Modelinglibrary(broom)library(mice)library(brms)library(posterior)library(bayesplot)library(knitr)# --- Guardrails for modeling ---n_outcome <-sum(!is.na(adult$diabetes_dx))if (n_outcome ==0) stop("Too few non-missing outcomes for modeling. n = 0")# Ensure factors and >=2 observed levels among complete outcomesadult <- adult %>% dplyr::mutate(sex =if (!is.factor(sex)) factor(sex) else sex,race =if (!is.factor(race)) factor(race) else race )if (nlevels(droplevels(adult$sex[!is.na(adult$diabetes_dx)])) <2)stop("sex has <2 observed levels after filtering; check data availability.")if (nlevels(droplevels(adult$race[!is.na(adult$diabetes_dx)])) <2)stop("race has <2 observed levels after filtering; check Data Prep.")# ------------------------- 1) Survey-weighted complete-case -------------------------# Build a logical filter on the original adult data (same length as design$data)keep_cc <-with( adult,!is.na(diabetes_dx) &!is.na(age_c) &!is.na(bmi_c) &!is.na(sex) &!is.na(race))# Subset the survey design using the logical vector (same length as original)des_cc <-subset(nhanes_design_adult, keep_cc)# Corresponding complete-case data (optional)cc <- adult[keep_cc, ] |>droplevels()cat("\nComplete-case N for survey-weighted model:", nrow(cc), "\n")
Complete-case N for survey-weighted model: 5349
Code
print(table(cc$race))
NH White Mexican American Other Hispanic NH Black
2293 713 470 1101
Other/Multi
772
Code
print(table(cc$diabetes_dx))
0 1
4752 597
Code
print(table(cc$sex))
Male Female
2551 2798
Code
form_cc <- diabetes_dx ~ age_c + bmi_c + sex + racesvy_fit <- survey::svyglm(formula = form_cc, design = des_cc, family =quasibinomial())summary(svy_fit)
Call:
svyglm(formula = form_cc, design = des_cc, family = quasibinomial())
Survey design:
subset(nhanes_design_adult, keep_cc)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.67143 0.11935 -22.383 1.68e-08 ***
age_c 1.10833 0.05042 21.981 1.94e-08 ***
bmi_c 0.63412 0.05713 11.099 3.88e-06 ***
sexFemale -0.63844 0.10926 -5.843 0.000386 ***
raceMexican American 0.71091 0.13681 5.196 0.000826 ***
raceOther Hispanic 0.46469 0.13474 3.449 0.008712 **
raceNH Black 0.51221 0.15754 3.251 0.011677 *
raceOther/Multi 0.84460 0.17756 4.757 0.001433 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasibinomial family taken to be 0.8455444)
Number of Fisher Scoring iterations: 6
Running /opt/R/4.4.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)’
gcc -I"/opt/R/4.4.2/lib/R/include" -DNDEBUG -I"/opt/R/4.4.2/lib/R/library/Rcpp/include/" -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/" -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/unsupported" -I"/opt/R/4.4.2/lib/R/library/BH/include" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/src/" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/" -I"/opt/R/4.4.2/lib/R/library/RcppParallel/include/" -I"/opt/R/4.4.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fpic -g -O2 -c foo.c -o foo.o
In file included from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
from /opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
from <command-line>:
/opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
679 | #include <cmath>
| ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.4.2/lib/R/etc/Makeconf:195: foo.o] Error 1
Code
summary(bayes_fit)
Family: bernoulli
Links: mu = logit
Formula: diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race
Data: adult_imp1 (Number of observations: 5592)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -2.66 0.09 -2.84 -2.50 1.00 4187 3510
age_c 1.09 0.06 0.97 1.22 1.00 3012 3098
bmi_c 0.63 0.05 0.53 0.72 1.00 3472 3315
sexFemale -0.66 0.10 -0.86 -0.46 1.00 4003 3052
raceMexicanAmerican 0.69 0.18 0.35 1.04 1.00 3526 2843
raceOtherHispanic 0.43 0.24 -0.07 0.89 1.00 4058 3114
raceNHBlack 0.54 0.15 0.24 0.82 1.00 3597 3177
raceOtherDMulti 0.82 0.19 0.45 1.19 1.00 3763 3257
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
prior_summary(bayes_fit)
prior class coef group resp dpar nlpar lb ub
normal(0, 2.5) b
normal(0, 2.5) b age_c
normal(0, 2.5) b bmi_c
normal(0, 2.5) b raceMexicanAmerican
normal(0, 2.5) b raceNHBlack
normal(0, 2.5) b raceOtherDMulti
normal(0, 2.5) b raceOtherHispanic
normal(0, 2.5) b sexFemale
student_t(3, 0, 10) Intercept
tag source
user
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
user
Code
###library(ggplot2)# Example: priors for two coefficients (age and bmi)prior_draws <-tibble(term =rep(c("Age (per 1 SD)", "BMI (per 1 SD)"), each =4000),value =c(rnorm(4000, 0, 2.5), rnorm(4000, 0, 2.5)))ggplot(prior_draws, aes(x = value, fill = term)) +geom_density(alpha =0.5) +theme_minimal() +labs(title ="Prior Distributions for Coefficients",x ="Coefficient Value", y ="Density") +scale_fill_manual(values =c("skyblue", "orange"))
Code
# Diabetes vs BMIlibrary(ggplot2)# Create the plotp3 <-ggplot(adult_imp1, aes(x =factor(diabetes_dx), y = bmi, fill =factor(diabetes_dx))) +geom_boxplot(alpha =0.7) +scale_x_discrete(labels =c("0"="No Diabetes", "1"="Diabetes")) +labs(x ="Diabetes Diagnosis",y ="BMI",title ="BMI Distribution by Diabetes Status" ) +theme_minimal() +theme(legend.position ="none")# Save the plot as an image file (PNG)ggsave("BMI_Distribution_by_Diabetes_Status.png", plot = p3, width =7, height =5, dpi =300)p3
Code
# logistic regression curvep4 <-ggplot(adult_imp1, aes(x = bmi, y = diabetes_dx)) +geom_point(aes(y = diabetes_dx), alpha =0.2, position =position_jitter(height =0.02)) +geom_smooth(method ="glm", method.args =list(family ="binomial"), se =TRUE, color ="blue") +labs(x ="BMI",y ="Probability of Diabetes",title ="Predicted Probability of Diabetes vs BMI" ) +theme_minimal()ggsave("Predicted Probability of Diabetes vs BMI.png", plot = p4, width =7, height =5, dpi =300)p4
Once we get posterior draws, we study Summary stats Mean, median, 95% credible intervals summary(bayes_fit) or posterior_summary(bayes_fit) - Visualization Distribution shape mcmc_hist(posterior, pars = c(“b_age”)) or mcmc_areas() - Pairwise plots Correlations between parameters mcmc_pairs(posterior) - Posterior predictive checks Compare model predictions vs observed data pp_check(bayes_fit) - Model comparison Using LOO or WAIC loo(bayes_fit) or waic(bayes_fit)
Assumptions for Bayesian logistic regression - posterior check - plots for linearity - mcmc trace plots for convergence - bayes_R2 for model fit
Code
summary(bayes_fit)
Family: bernoulli
Links: mu = logit
Formula: diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race
Data: adult_imp1 (Number of observations: 5592)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -2.66 0.09 -2.84 -2.50 1.00 4187 3510
age_c 1.09 0.06 0.97 1.22 1.00 3012 3098
bmi_c 0.63 0.05 0.53 0.72 1.00 3472 3315
sexFemale -0.66 0.10 -0.86 -0.46 1.00 4003 3052
raceMexicanAmerican 0.69 0.18 0.35 1.04 1.00 3526 2843
raceOtherHispanic 0.43 0.24 -0.07 0.89 1.00 4058 3114
raceNHBlack 0.54 0.15 0.24 0.82 1.00 3597 3177
raceOtherDMulti 0.82 0.19 0.45 1.19 1.00 3763 3257
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Alternative PP plots (histogram / barplot) for binary outcome (bar chart preferred being discrete outcome)p7 <-ppc_bars(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:50, ]) +labs(title ="Posterior Predictive Check: Barplot of Counts") +theme_minimal()ggsave("ppc_bars.png", plot = p7, width =8, height =6, dpi =300)p7
Code
#PP check for proportions (useful for binary) # mean comparison## to check if the simulated means match the observed mean## meanp8 <-ppc_stat(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ], stat ="mean") +labs(title ="Posterior Predictive Check: Mean of Replicates") +theme_minimal()ggsave("ppc_stat_mean.png", plot = p8, width =8, height =6, dpi =300)p8
Code
## sdp9 <-ppc_stat(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ], stat ="sd") +labs(title ="PPC: Standard Deviation of Replicates") +theme_minimal()ggsave("ppc_stat_sd.png", plot = p9, width =8, height =6, dpi =300)p9
library(posterior)library(bayesplot)# Extract posterior draws as a draws_df # simulate posterior outcomespost <-as_draws_df(bayes_fit)# Check parameter namescolnames(post)
# Density overlay for age and bmip11 <-mcmc_areas(post, pars =c( "b_age_c","b_bmi_c","b_sexFemale","b_raceMexicanAmerican", "b_raceOtherHispanic","b_raceNHBlack","b_raceOtherDMulti" ))ggsave("mcmc_areas.png", plot = p11, width =8, height =6, dpi =300)p11
# Combine observed and predicted into one data frameplot_data <- adult_imp1 %>%mutate(predicted_bmi = predicted[, "Estimate"],lower_ci = predicted[, "Q2.5"],upper_ci = predicted[, "Q97.5"],obs_index =1:nrow(adult_imp1) # index for x-axis )# Line plotp13 <-ggplot(plot_data, aes(x = obs_index)) +geom_line(aes(y = bmi, color ="Observed")) +# observed BMIgeom_line(aes(y = predicted_bmi, color ="Predicted")) +# predicted BMIgeom_ribbon(aes(ymin = lower_ci, ymax = upper_ci), alpha =0.2) +# uncertaintylabs(x ="Observation", y ="BMI", color ="Legend") +theme_minimal()ggsave("Line plot_bmi_obs_pred.png", plot = p13, width =8, height =6, dpi =300)p13
Code
###summary(adult_imp1$bmi)
Min. 1st Qu. Median Mean 3rd Qu. Max.
14.1 24.1 27.7 29.0 32.4 82.9
Code
summary(plot_data$bmi_c)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.09476 -0.69669 -0.19338 -0.01133 0.46371 7.52398
Code
prior_summary(bayes_fit)
prior class coef group resp dpar nlpar lb ub
normal(0, 2.5) b
normal(0, 2.5) b age_c
normal(0, 2.5) b bmi_c
normal(0, 2.5) b raceMexicanAmerican
normal(0, 2.5) b raceNHBlack
normal(0, 2.5) b raceOtherDMulti
normal(0, 2.5) b raceOtherHispanic
normal(0, 2.5) b sexFemale
student_t(3, 0, 10) Intercept
tag source
user
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
user
term estimate type
Length:8000 Min. :-3.585220 Length:8000
Class :character 1st Qu.:-0.696978 Class :character
Mode :character Median : 0.000572 Mode :character
Mean :-0.007125
3rd Qu.: 0.668627
Max. : 3.922817
library(brms)library(tidyr)# Extract posterior drawspost <-as_draws_df(bayes_fit) %>%# bayes_fit = your brms modelselect(b_bmi_c, b_age_c) %>%# select your coefficient columnspivot_longer(everything(),names_to ="term",values_to ="estimate" ) %>%mutate(term =case_when( term =="b_bmi_c"~"BMI (per 1 SD)", term =="b_age_c"~"Age (per 1 SD)" ),type ="Posterior" )## visualization of prior and predicted drawscombined_draws <-bind_rows(prior_draws, post) library(ggplot2)p14 <-ggplot(combined_draws, aes(x = estimate, fill = type)) +geom_density(alpha =0.4) +facet_wrap(~ term, scales ="free", ncol =2) +theme_minimal(base_size =13) +labs(title ="Prior vs Posterior Distributions",x ="Coefficient estimate",y ="Density",fill ="" )p14
Code
ggsave("Prior vs Posterior Distributions_bmi_age.png", plot = p14, width =8, height =6, dpi =300)#### Compute proportion of diabetes=1 for each drawpp_proportion <-rowMeans(pp_samples) # proportion of 1's in each posterior draw# Summary of posterior proportionssummary(pp_proportion)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.09371 0.10479 0.10908 0.10923 0.11373 0.13215
Code
# Optional: visualize the posterior probability distributionpp_proportion_df <-tibble(proportion = pp_proportion)p15 <-ggplot(pp_proportion_df, aes(x = proportion)) +geom_histogram(binwidth =0.01, fill ="skyblue", color ="black") +labs(title ="Posterior Distribution of Proportion of Diabetes = 1",x ="Proportion of Diabetes = 1",y ="Frequency" ) +theme_minimal()p15
Code
ggsave("Posterior Distribution of Proportion of Diabetes = 1.png", plot = p15, width =8, height =6, dpi =300)
Code
####library(tidyverse)# Posterior predicted proportion vector# pp_proportion <- rowMeans(pp_samples) # if not already doneknown_prev <-0.089# NHANES prevalence# Posterior summaryposterior_mean <-mean(pp_proportion)posterior_ci <-quantile(pp_proportion, c(0.025, 0.975)) # 95% credible interval# Create a data frame for plottingpp_df <-tibble(proportion = pp_proportion)# Plotp16 <-ggplot(pp_df, aes(x = proportion)) +geom_histogram(binwidth =0.005, fill ="skyblue", color ="black") +geom_vline(xintercept = known_prev, color ="red", linetype ="dashed", size =1) +geom_vline(xintercept = posterior_mean, color ="blue", linetype ="solid", size =1) +geom_rect(aes(xmin = posterior_ci[1], xmax = posterior_ci[2], ymin =0, ymax =Inf),fill ="blue", alpha =0.1, inherit.aes =FALSE) +labs(title ="Posterior Predicted Diabetes Proportion vs NHANES Prevalence",subtitle =paste0("Red dashed = NHANES prevalence (", known_prev, "), Blue solid = Posterior mean (", round(posterior_mean,3), ")"),x ="Proportion of Diabetes = 1",y ="Frequency" ) +theme_minimal()ggsave("Posterior Distribution of Proportion of Diabetes = 1.png", plot = p16, width =8, height =6, dpi =300)p16
Code
library(dplyr)# Posterior predicted proportionposterior_mean <-mean(pp_proportion)posterior_ci <-quantile(pp_proportion, c(0.025, 0.975)) # 95% credible interval# NHANES prevalence with SE from survey::svymean# Suppose you already have:# svymean(~diabetes_dx, nhanes_design_adult, na.rm = TRUE)known_prev <-0.089# Mean prevalenceknown_se <-0.0048# Standard error from survey# Calculate 95% confidence intervalknown_ci <-c( known_prev -1.96* known_se, known_prev +1.96* known_se)# Print resultsdata.frame(Type =c("Posterior Prediction", "NHANES Prevalence"),Mean =c(posterior_mean, known_prev),Lower_95 =c(posterior_ci[1], known_ci[1]),Upper_95 =c(posterior_ci[2], known_ci[2]))
Type Mean Lower_95 Upper_95
2.5% Posterior Prediction 0.109235 0.0975456 0.1207082
NHANES Prevalence 0.089000 0.0795920 0.0984080
Code
####library(ggplot2)library(dplyr)# Create a data frame for plottingci_df <-data.frame(Type =c("Posterior Prediction", "NHANES Prevalence"),Mean =c(0.1096674, 0.089),Lower_95 =c(0.09772443, 0.079592),Upper_95 =c(0.1210658, 0.098408))# Plotp17 <-ggplot(ci_df, aes(x = Type, y = Mean, color = Type)) +geom_point(size =4) +geom_errorbar(aes(ymin = Lower_95, ymax = Upper_95), width =0.2) +ylim(0, max(ci_df$Upper_95) +0.02) +labs(title ="Comparison of Posterior Predicted Diabetes Proportion vs NHANES Prevalence",y ="Proportion of Diabetes",x ="" ) +theme_minimal(base_size =14) +theme(legend.position ="none")ggsave("Population Prevalence vs Proportion of Diabetes = 1.png", plot = p17, width =8, height =6, dpi =300)p17
Results
Multiple Linear Regression
Model:svyglm(formula = form_cc, design = des_cc, family = quasibinomial())
All predictors are significant: age (p < 0.001) and BMI (p < 0.001) show strong positive associations with the outcome, while being female (p = 0.0004) is negatively associated. Other significant associations include raceMexican American (p = 0.0008), raceOther Hispanic (p = 0.0087), raceNH Black (p = 0.0117), and raceOther/Multi (p = 0.0014).
MICE
All predictors are statistically significant.
Positive associations: age (p < 0.001), BMI (p < 0.001), and all race categories compared to reference.
Rhat = 1.00 for all parameters → excellent convergence
Bulk_ESS / Tail_ESS: Large values (>2000) → high effective sample sizes, reliable posterior estimates.
Interpretation
Strong predictors: Age and BMI are strongly positively associated with diabetes risk.
Sex effect: Females have a lower probability of diabetes compared to males
R² = 0.13 shows 13% of the variance in the outcome (diabetes_dx) is explained by your predictors (age, BMI, sex, race), 95% Credible Interval: 0.106–0.156, indicates that, given your model and data, the true proportion of variance explained is plausibly between 10.6% and 15.6% and shows uncertainty in the explained variance, which is natural for probabilistic models.
Est.Error = 0.0127, reflects the standard error of the R² estimate across posterior samples. The small SE indicates that the R² estimate is fairly precise.
Race/ethnicity: Mexican American, NH Black, and “Other/Diverse” groups have higher odds of diabetes. Other Hispanic group has a less certain effect.
age_c-Each 1-unit increase in centered age increases the log-odds of diabetes by 1.09. Strong positive effect.
bmi_c-Higher BMI is associated with higher diabetes risk. sexFemale-Females have lower odds of diabetes compared to males.
raceMexicanAmerican-Higher odds of diabetes vs. reference race (likely NH White)
raceOtherHispanic-Slightly higher odds vs reference, but interval crosses zero → uncertain effect.
raceNHBlack-Significantly higher odds of diabetes compared to reference. raceOtherDMulti-Higher odds of diabetes vs reference group.
Posterior distribution of all parameters in the model. (1)Density plot of posterior samples each parameter (e.g., intercept, slope) into a smoothed density curve, showing most of the posterior probability mass lies for best estimates and uncertainty.
Posterior Predictive Distribution - generated from posterior predictive draws: 𝑦rep∼𝑝(𝑦new∣𝜃)yrep∼p(ynew∣θ) simulate the data given posterior parameter estimates.Posterior predictive checks (PPC) compare these simulations to real data to assess model fit.
Incorporating Uncertainty two sources of uncertainty: Parameter uncertainty: captured in the posterior distributions Predictive uncertainty: captured in posterior predictive draws
Combining the two provide credible intervals for predictions, not just point predictions and specifies - Given the BMI, the probability of diabetes is likely between 40–55%.”
Comparing Models
All three models (survey-weighted MLE, multiple imputation, Bayesian) agree closely on the direction and magnitude of the effects of BMI and age.
Age is a stronger predictor than BMI, nearly tripling the odds per 1 SD.
BMI significantly increases diabetes risk (~1.7–1.9× per 1 SD).
Differences between models are minor, indicating robust and reliable findings despite missing data or modeling approach.
Diabetes Predicted proportion vs population prevalence
The posterior predicted proportion is slightly higher than the observed NHANES prevalence (10.97% vs 8.9%).
Intervals overlap slightly, but the posterior tends to overestimate diabetes compared to NHANES.
The difference could be due to:
Model assumptions (logistic link, priors)
Predictor effects (BMI, age, sex, race)
Sample characteristics vs population weighting in NHANES
The results find our model is reasonable but slightly conservative (predicting higher risk) relative to the observed population prevalence.
Conclusion
Across multiple modeling approaches—survey-weighted maximum likelihood, multiple imputation, and Bayesian regression—both age and BMI were consistently strong predictors of diabetes. Eachstandard deviation increase in age nearly tripled the odds of diabetes, while a similar increase in BMI elevated the odds by approximately 1.7–1.9 times. The consistency of these results across models highlights the robustness of the associations and underscores the importance of age and BMI as key risk factors for diabetes in this population.
Effect stability: point estimates in rhe Bayesian model’s closely aligned with those from the frequentist, indicating that prior regularization stabilized the estimates in the presence of modest missingness.
Uncertainty quantification: Bayesian credible intervals of odds ration were slightly narrower yet overlapped the frequentist confidence intervals, suggest comparable inferential precision while offering improved interpretability.
Design considerations: # Survey-weighted MLE (Maximum Likelihood Estimator) - incorporates each observation weighted according to its survey weight. - provide estimates that reflect the population-level parameters, not just the sample- produces population-representative estimates. # Bayesian model with normalized weights- - instead of fully modeling the survey design, it used normalized sampling weights as importance weights - the scaled weights that sum to the sample size approximates the effect of survey weights, but does not fully account for: Stratification, clustering, design-based variance adjustments. - Bayesian inference treats the weighted likelihood as from a regular model, ignoring some survey design features.
Discussions
The use of multiple imputation allowed for robust analysis despite missing data, increasing power and reducing bias. Comparison of frequentist and Bayesian models demonstrated consistency in significant predictors, while Bayesian approaches provided the advantage of posterior distributions and probabilistic interpretation. The =
Across all models, both age and BMI emerged as strong and consistent predictors of diabetes. The consistency across modeling approaches strengthens the validity of these findings Multiple imputation accounted for potential biases due to missing data, and Bayesian modeling provided robust credible intervals that closely matched frequentist estimates. align with previous epidemiological research indicating that increasing age and higher BMI are among the most important determinants of type 2 diabetes risk.Cumulative exposure to metabolic and lifestyle risk factors over time, and the role of excess adiposity and insulin related effects account for diabetes.
Survey weighted dataset strenghthens ensuring population representativeness, multiple imputation to handle missing data, and rigorous Bayesian estimation provided high effective sample sizes and R̂ ≈ 1.00 across parameters confirmed excellent model convergence. Bayesian logistic regression provided inference statistically consistent and interpretable achieving the aim of this study. In future research hierarchical model using NHANES cycles or adding variables (lab tests) could assess nonlinear effects of metabolic risk factors.
Limitations
Our study was a cross-sectional study design - precludes potential residual confounding from unmeasured factors such as diet, physical activity, and genetic predisposition
Implications
age and BMI as robust and independent predictors of diabetes, underscore the importance of early targeted interventions in mitigating diabetes risk.
Longitudinal studies and combining other statistical analytical methods with Bayesian can further enhance and provide better informed precision prevention strategies.
References
Austin, Peter C., Ian R. White, Douglas S. Lee, and Stef van Buuren. 2021. “Missing Data in Clinical Research: A Tutorial on Multiple Imputation.”Canadian Journal of Cardiology 37 (9): 1322–31. https://doi.org/10.1016/j.cjca.2020.11.010.
Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in R.”Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Center for Health Statistics, National. 1999. “Vital and Health Statistics Reports Series 2, Number 161, September 2013.”National Health and Nutrition Examination Survey: Analytic Guidelines, 1999–2010. https://wwwn.cdc.gov/nchs/data/series/sr02_161.pdf.
Chatzimichail, Theodora, and Aristides T. Hatjimihail. 2023. “A Bayesian Inference Based Computational Tool for Parametric and Nonparametric Medical Diagnosis.”Diagnostics 13 (19). https://doi.org/10.3390/DIAGNOSTICS13193135,.
D’Angelo, Gina, and Di Ran. 2025. “Tutorial on Firth’s Logistic Regression Models for Biomarkers in Preclinical Space.”Pharmaceutical Statistics 24 (1): 1–8. https://doi.org/10.1002/pst.2422.
Klauenberg, Katy, Gerd Wübbeler, Bodo Mickan, Peter Harris, and Clemens Elster. 2015. “A tutorial on Bayesian Normal linear regression.”Metrologia 52 (6): 878–92. https://doi.org/10.1088/0026-1394/52/6/878.
Leeuw, Christiaan de, and Irene Klugkist. 2012. “Augmenting Data With Published Results in Bayesian Linear Regression.”Multivariate Behavioral Research 47 (3): 369–91. https://doi.org/10.1080/00273171.2012.673957.
Liu, Yi Ming, Sam Li Sheng Chen, Amy Ming Fang Yen, and Hsiu Hsi Chen. 2013. “Individual risk prediction model for incident cardiovascular disease: A Bayesian clinical reasoning approach.”International Journal of Cardiology 167 (5): 2008–12. https://doi.org/10.1016/J.IJCARD.2012.05.016.
Schoot, Rens van de, Sarah Depaoli, Ruth King, Bianca Kramer, Kaspar Märtens, Mahlet G. Tadesse, Marina Vannucci, et al. 2021. “Bayesian statistics and modelling.”Nature Reviews Methods Primers 1 (1): 1. https://doi.org/10.1038/s43586-020-00001-2.
Van Buuren, Stef, and Stef Van Buuren. 2012. Flexible Imputation of Missing Data. Vol. 10. CRC press Boca Raton, FL.